I have applied a range of statistical methods and developed software platforms and analytical tools to process and analyze large-scale sequencing datasets. These tools facilitate the discovery of patterns and similarities across diverse omics datasets, enabling the construction of statistical models that support robust hypothesis generation.
In various projects, I have integrated data from multiple sources, including:
Examples consist of:
Multi-task Regression Methods: These methods are used to simultaneously model multiple related outputs, enabling efficient handling of data from various sources. In the context of omics data, multi-task regression can capture shared patterns across related datasets, improving predictive accuracy and providing insights into interconnected biological processes.
Classification Methods: I employ a variety of classical classification techniques, such as logistic regression, decision trees, and support vector machines (SVMs), to categorize data based on features derived from omics datasets. These methods are essential for distinguishing between different biological states, such as healthy vs. diseased samples or cell type classifications. By applying these well-established algorithms, I can create predictive models that provide insights into underlying biological processes and help identify potential biomarkers. Additionally, I incorporated positive-unlabeled learning when working with partially labeled data, allowing for the effective classification of samples in scenarios where only a subset has known annotations.
Unsupervised Methods, e.g. Autoencoders: Autoencoders are neural network models that learn compact, informative representations of high-dimensional data. In omics analysis, I leverage autoencoders to uncover hidden structures and reduce data dimensionality, facilitating the discovery of patterns in transcription factor activity and other biological features. These learned embeddings can further be used to cluster data, identify relationships, and detect anomalies.